[Minor] add non topk benchmarks for utf8/utf8view string aggregates by buraksenn · Pull Request #21073 · apache/datafusion

buraksenn · 2026-03-20T07:02:33Z

Which issue does this PR close?

Closes Add non-TopK benchmark variants for Utf8/Utf8View string aggregates #19713.

Rationale for this change

Details are in #19713 but main idea is to compare non-topk and topk test results so that we can compare performances

What changes are included in this PR?

Added non topk benchmark tests.

Are these changes tested?

Only test changes

Are there any user-facing changes?

No

kosiew

👋 @buraksenn

Thanks for working on this.

datafusion/core/benches/topk_aggregate.rs

buraksenn · 2026-03-20T18:37:43Z

Thanks @kosiew for the detailed review

kosiew

@buraksenn

Thanks for the updates here, this is heading in a good direction. I did notice one issue that affects the benchmark results, plus a couple of smaller readability suggestions.

kosiew · 2026-03-23T04:27:52Z

datafusion/core/benches/topk_aggregate.rs

    assert_eq!(batch.num_rows(), LIMIT);

-    Ok(())
+    Ok(format!("{}", pretty_format_batches(&batches)?))


It looks like aggregate_string now pretty prints the collected batches and returns a String, and that is what b.iter(...) is measuring for each string benchmark.

This means the non TopK string benchmarks now include batch formatting and heap allocation overhead, which the numeric benchmarks do not include. So the comparison is no longer isolating query execution.

Would it make sense to keep a Result<()> fast path for the timed benchmark, and move the pretty_format_batches work into a one time validation helper that runs before benchmark registration?

I've changed it such that it is done in the caller path and function returns vector

kosiew · 2026-03-23T04:27:52Z

datafusion/core/benches/topk_aggregate.rs

-        .as_str(),
-        |b| b.iter(|| run_string(&rt, ctx.clone(), limit, true)),
-    );
+    for asc in [false, true] {


Small readability thought here. Could the Utf8 vs Utf8View parity check live in a helper?

Right now criterion_benchmark is doing both benchmark registration and cross layout validation. Pulling the verification loop into something like assert_string_results_match(...) would make the benchmark matrix easier to scan.

I've extracted two helpers instead of this

kosiew · 2026-03-23T04:27:52Z

datafusion/core/benches/topk_aggregate.rs

-    );
+    // String aggregate benchmarks
+    // (asc, use_topk, use_view)
+    let string_cases: &[(bool, bool, bool)] = &[


The &[(bool, bool, bool)] tuple list is a bit hard to read and reason about when scanning or extending cases.

Maybe a small case struct or a helper similar to numeric_cases would help make the intent clearer. Something that names asc, use_topk, and use_view explicitly would also reduce the chance of mixing up tuple positions later.

I've made this and numeric one both structs

buraksenn · 2026-03-23T19:46:36Z

Thanks for the detailed review twice @kosiew, really appreciate it :)

added benchmarks and made separate tests

32464db

github-actions bot added the core Core DataFusion crate label Mar 20, 2026

kosiew requested changes Mar 20, 2026

View reviewed changes

datafusion/core/benches/topk_aggregate.rs Show resolved Hide resolved

datafusion/core/benches/topk_aggregate.rs Show resolved Hide resolved

datafusion/core/benches/topk_aggregate.rs Outdated Show resolved Hide resolved

buraksenn added 2 commits March 20, 2026 21:35

address reviews

f3e6f58

ordering change

f3b83ee

buraksenn changed the title ~~[Minor] add non topk benchmarks for ut8/ut8view string aggregates~~ [Minor] add non topk benchmarks for utf8/utf8view string aggregates Mar 20, 2026

kosiew requested changes Mar 23, 2026

View reviewed changes

address comments

c3c6d5c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Minor] add non topk benchmarks for utf8/utf8view string aggregates#21073

[Minor] add non topk benchmarks for utf8/utf8view string aggregates#21073
buraksenn wants to merge 4 commits intoapache:mainfrom
buraksenn:add-non-top-k-benchmarks-to-compare

buraksenn commented Mar 20, 2026

Uh oh!

kosiew left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

buraksenn commented Mar 20, 2026

Uh oh!

kosiew left a comment •

edited

Loading

Uh oh!

kosiew Mar 23, 2026

Uh oh!

buraksenn Mar 23, 2026

Uh oh!

kosiew Mar 23, 2026

Uh oh!

buraksenn Mar 23, 2026

Uh oh!

kosiew Mar 23, 2026

Uh oh!

buraksenn Mar 23, 2026

Uh oh!

buraksenn commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

buraksenn commented Mar 20, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kosiew left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

buraksenn commented Mar 20, 2026

Uh oh!

kosiew left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

buraksenn Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

buraksenn Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

buraksenn Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

buraksenn commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kosiew left a comment •

edited

Loading

kosiew left a comment •

edited

Loading